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The claims data of the Health Insurance Review and Assessment Service (HIRA) is an important source of in- 
formation for healthcare service research. The claims data of HIRA is collected when healthcare service pro- 
viders submit a claim to HIRA to be reimbursed for a service that they provided to patients. To improve the 
accessibility of healthcare service researchers to claims data of HIRA, HIRA has developed the Patient Sam- 
ples which are extracted using a stratified randomized sampling method. The Patient Samples of HIRA consist 
of five tables: a table for general information (Table 20) containing socio-demographic information such as gen- 
der, age and medical aid, indicators for inpatient and outpatient services; a table for specific information on health- 
care services provided (Table 30); a table for diagnostic information (Table 40); a table for outpatient prescrip- 
tions (Table 53) and a table for information on healthcare service providers (Table of providers). Researchers 
who are interested in using the Patient Sample data for research can apply via HIRAs website (https://www. 
hira.or.kr). 
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INTRODUCTION 

South Korea has a universal health coverage system that the 
National Health Insurance covers approximately 98% of the 
overall Korean population. The claims data of Health Insurance 
Review and Assessment Service (HIRA) contains 46 million pa- 
tients per year that account for 90% of the total population in 
Korea and include claims from almost 80,000 healthcare ser- 
vice providers across South Korea as of 2011. The claims data of 
HIRA includes patients' diagnosis, treatment, procedures, surgi- 
cal history, and prescription drugs which provide a valuable re- 
source for healthcare service research. However, the complex 
structure and vast volume of claims data require considerable 
efforts on the part of the researcher to understand them. In ad- 
Correspondence: Jee-Ae Kim 

Health Insurance Review and Assessment Service, 267 Hyoryeong-ro, 
Seocho-gu, Seoul 137-927, Korea 

Tel: +82-2-2182-2600, Fax: +82-2-6710-5836, E-mail: kja0813@hiramail.net 

Received: Apr 1 4, 201 4, Accepted: Jul 29, 2014, Published: Jul 30, 201 4 
This article is available from: http://e-epiti.org/ 
© 2014, Korean Society of Epidemiology 

© This is an open-access article distributed under the terms of the Creative Commons 
Attribution License (http://creativecommons.Org/licenses/by/3.0/), which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original 
work is properly cited. 



dition, the applications to use the claims data of HIRA and their 
deliberation process take significant amount of time, and these 
processes restrict the researcher from accessing the data. Fur- 
thermore, the vast volume of the data can result in their ineffi- 
ciency in conducting research. To resolve these limitations and 
improve accessibility for researchers to use the claims data, HI- 
RA has developed the Patient Sample data that had passed va- 
lidity tests performed by five different institutions. 

The Patient Samples are stratified random samples of claims 
data of HIRA. The sizes for the samples had carefully been cal- 
culated and extracted to improve representativeness of the so- 
cio-demographic characteristics, diagnosis, healthcare services 
including prescription drugs for Korean patients on a year basis. 
HIRA has provided four samples with different groups in order 
to enhance reliability and representativeness by extracting a 
sample from a specific area. As claims data of HIRA consist of 
10% of inpatients and 90% of outpatients, the National Patient 
Sample (NPS) may not have enough cases to investigate inpa- 
tient services for severe health conditions.To support the research 
in groups of which representativeness is not ensured in NPS data, 
currently available are samples that were separately sample: 
National Patient Sample (HIRA-NPS), National Inpatient Sam- 
ple (HIRA-NIS), Adult Patient Sample (HIRA- APS), and Pedi- 
atric Patient Sample (HIRA-PPS) (Table 1). 
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The Patient Samples are annually updated each year because 
they are developed through the accumulation of claims data 
over a year-long cycle. However, the Patient Samples are cross- 
sectional and different patients are selected for the sample data 
each year to protect their privacy. As such, a specific individual 
or healthcare service provider is not subject to a longitudinal 
data extraction, even within the same type of Patient Samples. 
Therefore, it is not possible to conduct research that requires a 
long-term following of patients with the Patient Samples. 

Case study from abroad 

In the US, a federal research institution called the Agency for 
Healthcare Research and Quality (AHRQ) performs and sup- 
ports research associated with healthcare services. AHRQ collects 
data from 37 state governments, community, and the health- 
care industry organizations to form a healthcare database. One 
of AHRQ's research programs, Healthcare Cost and Utilization 
Project (HCUP), constitutes the largest healthcare database in 
the US. Among the data provided by HCUP, the National Inpa- 
tient Sample (NIS) is the most comprehensive inpatient data, 
which includes all the healthcare communities within the Amer- 
ican Hospital Association, except for physical therapy institu- 
tions. The NIS is based on data collected from 3,900 member 
healthcare institutions spanning 37 states. Among the member 
healthcare institutions, approximately 20% (800-1,100 institu- 
tions) are extracted for sampling to include all of the inpatient 
data (5-8 million cases) from the sample institutions (Table 2). 



Data resource area and population coverage 

The claims data of HIRA is collected when healthcare service 
providers in South Korea seek reimbursements for healthcare 
services that the National Health Insurance Corporation agrees 
to cover. The annual number of Korean patients that submitted 
health insurance claims is approximately 46 million. The claims 
data of HIRA is a national data compiled from healthcare pro- 
viders across the country that corresponds to the number of 
claims submitted by patients. In addition, the claims from pa- 
tients with medical aid program, government expenditures, and 
veteran patients are also included in the claims data. 

MEASURES 

Extraction method 

Because the standards of differentiation for claims submitted 
to HIRA are clearly defined, the Patient Samples adopted strati- 
fied sampling, a probabilistic sample extraction method. Based 
on the two stratum of sex (2 strata) and age (16 strata), the sam- 
ple was divided into a total of 32 strata before random extrac- 
tion. Demographically stratification of claims data at the patient 
level secures representativeness of the claims data in accounting 
for the time series of data which differ based on types of the 
healthcare service settings (inpatient or outpatient), the cycle of 
claims data submissions from providers (daily or monthly), and 
types of diseases. 



Table 1. Sample sizes and computation of each sample data 



Sample type Computation standard 



HIRA- 


■ National Inpatient Sample (2009-201 1 ) 


700,000 inpatients per year (13%), approximately 400,000 outpatients per year (1%) 


HIRA- 


- National Patient Sample (2010-201 1) 


1 .4 million patients overall per year (3%) 


HIRA- 


■ Adult Patient Sample (201 0-201 1 ) 


Approximately 1 million patients over the age of 65 per year (20%) 


HIRA- 


- Pediatric Patient Sample (2010-2011) 


Approximately 1.1 million patients under the age of 20 per year (10%) 



The size of each sample data is either 1 .5 million patients or 20% of the total number in the corresponding area of the original data. 
HIRA, Health Insurance Review and Assessment Service. 



Table 2. Comparison of the nations' sample datasets 



Country-based comparison Korea (HIRA) US (AHRQ) Taiwan (NHIRD) 



Sampling unit 


Patient sampling 


Hospital sampling 


Patient sampling 


Unit of data provided 


Patient based 


Institution case-based (discharge data) 


Patient data 


Stratification variable 


Demographic characteristic 
(sex, age group) 


Hospital characteristics 
Geographic location 


Simple random sampling 


Data providee 


All researchers 


All researchers 


National research institutions and 
researchers (general public uses 
the educational data set) 


Sampling unit 


Inpatients, all patients, pediatric and teen 
patients, elderly patients under 1 .5 million 
people per category 


Approximately 7 million institutional 
cases (inpatient data) 


Health-plan registrees, 1 million 
people 



HIRA, Health Insurance Review and Assessment Service; AHRQ, Agency for Healthcare Research and Quality; NHIRD, National Health Insurance Research 
Database. 
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As the charge in the National Health Insurance claims data 
exhibit the maximum variance and best reflect the characteris- 
tics of the claims data, the charge was chosen as a sample vari- 
able. Under the assumption of acceptable sampling error range 
and normal distribution, the standard deviation and the sample 
size were calculated with the following equation. 

Using the above equation, the patient sample sizes that best 
reflects the representativeness of the overall claims data was 
determined as shown in the Table 3. Table 3 presents the com- 
parison result between the estimated population derived from 
adding weighted values to the HIRA sample data and the actu- 
al population. The estimated population and the actual popula- 
tion exhibited a 95% concordance demonstrating a high level 
of representativeness. 

Explanation of variables 

The Patient Samples consist of five tables: Table 20 (general 
specifications), Table 30 (health services), Table 40 (diagnosis in- 
formation), Table 53 (outpatient prescription), and Table of Pro- 
viders (Healthcare Service Provider Information). 

All tables are linkable using a key ID. Table 20 (general speci- 
fications) includes the general characteristics of the patient such 
as socio-demographic characteristics (gender, age, and medical- 
aid program), major diagnosis, secondary diagnosis, payer's 
amount, and patient's out-of-pocket cost. Table 30 (health ser- 
vices) includes details in inpatient and outpatient healthcare 
services provided to patients such as procedures, treatment and 



prescription drugs for inpatients. Table 40 (diagnosis informa- 
tion) contains all of the diagnosis information that patients have 
had. Table 40 is used when a patient's concomitant disease or a 
history of all conditions is deemed necessary. Table 53 (outpa- 
tient prescriptions) shows drug information that had been pre- 
scribed recorded for outpatients such as active ingredients, dos- 
age and days of supply. Finally, Table of Providers includes in- 
formation about the healthcare service provider that the patient 
had visited, including the type of healthcare service provider 
(primary care, secondary care, tertiary care), location, sizes, and 
ownership type (Table 4). 

Strengths and weaknesses 

The Patient Samples have several important strengths. The 
first one is representativeness of the total patient population in 
South Korea and has advantages in generalization for the popu- 
lation. Secondly, the Patient Samples have comprehensive yet 
specific information on healthcare services including prescrip- 
tion drugs provided under the fee-for-service system. Thirdly, as 
the Patient Samples passed the validity test [1], they are proved 
efficient in estimating population to save costs and time in con- 
ducting research. HIRA has made memorandum of understand- 
ing (MOU) with five academic societies in healthcare service 
(Korean Society for Preventive Medicine, Korean Association of 
Health Economics and Policy, Korean Society of Health Infor- 
mation and Health Statistics, Korean Academy of Health Policy 
and Management, Korean Society of Epidemiology) and per- 



Table 3. Comparison between Patient Samples and the actual population (unit: person) 



Sample type 


Sample size (%) 


Estimated population 


Actual population 


HIRA - National Patient Sample 


1,375,842(3) 


45,861,321 


47,026,505 


HIRA - National Inpatient Sample 


765,564(13) 


5,888,921 


6,026,063 


HIRA -Adult Patient Sample 


1,073,183(19) 


5,365,917 


5,650,51 1 


HIRA - Pediatric Patient Sample 


1,026,648(10) 


10,266,474 


10,681,503 


The percentage of the total patients in claims data. 
Based on data from 201 1 . 








Table 4. Main variables in the Patient Samples 








Table 




Variables 





Table 20 (general specification) 



Table 30 (healthcare services) 

Table 40 (diagnosis information) 
Table 53 (outpatient prescriptions) 

Table of Providers 



Billing statement identification code (key ID), patient ID, provider's ID, stratification variables, strata, age, gender, sample 
weight, DRG billing number, claims types, date of admission, insurance type, hospital arrival path way, major diagno- 
sis, secondary diagnosis, injury from public service, days of care, initial date of care, final date of care, days in hospi- 
talization, payer's amount, patient's out of pocket cost, total amount, surgical status 

Billing statement identification code (Key ID), service category, classification type, unit price, total price, daily dosages, 
days of supply, quantity of supply, service codes, drug codes 

Billing statement identification code (key ID), indicator for major diagnosis, department, diagnosis 

Billing statement identification code (key ID), classification type, unit price, total price, daily dosages, days of supply, 
quantity of supply, service codes, drug codes 

Provider ID, type of providers, presence of special equipment (CT, MRI, PET), location, number of beds, number of staff 
per 50 beds - physicians, dentists, acupuncturists, and nurses 



IDs are given an alternative ID to protect private information. 
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formed tests to validate with the Patient Samples. 

Among the major test results, "Korean prevalence of diabetes 
and evaluations of dipeptidyl peptidase-4 inhibitor use" dem- 
onstrated that the estimate on diabetes prevalence using sample 
data aligned with that of population-analysis. Moreover, the es- 
timate on prescription of hypoglycemic agents aligned with the 
results of population-analysis. Additionally, outpatient prescrip- 
tion rates of each hypoglycemic agent were all within the 95% 
confidence interval. In the test of "The burden of social cost es- 
timates of diseases associated with vision loss and blindness," 
the Patient Samples and the population both exhibited higher 
health service use in female patients with major eye diseases 
(cataracts, glaucoma, macular degeneration, diabetic changes in 
the retina, and retinal vein occlusion) than in male patients. 
Furthermore, the Patient Samples and the population exhibited 
similar age trends with respect to health service use in cases of 
cataracts, glaucoma, and macular degeneration. 

Despite strengths described above, a few limitations need to 
be noted when researcher is interested in utilizing the Patient 
Samples for research. First, the accuracy of diagnosis has been 
an issue due to the nature of claims data which is collected with 
a purpose of reimbursing healthcare services not of clinical pur- 
pose. Hence, it is possible that diagnosis information in claims 
data is susceptible to up-coding by providers seeking for higher 
reimbursement rate or diagnosis remains in the data even when 
they are ruled out after running lab tests. This implies patients 
with a certain diagnosis would not necessarily mean that they 
have the disease corresponding to the diagnosis. The inaccuracy 
of diagnosis information in claims data may not be a problem 
only for the claims data of HIRA but also most other claims 
data although the problem with diagnosis information in the 
claims data of HIRA can be more serious due to the fee-for-ser- 
vice system and reimbursement policies. 

The study shows that diagnosis in the claims data of HIRA 
tends to be more accurate in cases of severe diseases rather 
than frequently occurring mild diseases [2]. In addition, they 
exhibit greater accuracy in inpatient setting than outpatient cas- 
es, and in hospitals rather than clinics [2]. To address the inac- 
curacy of diagnosis information, researchers use operational 
definition to identify patients with a certain disease rather than 
simply using diagnosis in the data [3-5]. Secondly, the Sample 
data may not have sufficient cases for rare disease and a certain 
age group with lower frequency. Third, the socioeconomic char- 
acteristics and risk-factors such as patient's income, education, 
location, weight, height, mortality, and health behaviors such as 
smoking, drinking, exercise amount are deficient, resulting in 
limitations in conducting through research. HIRA has projects 
to link data from other institutions to enrich the patient socio- 
economic and risk-factor variables. Finally, the Patient Samples 
are cross-sectional, making a longitudinal study following a 



same individual patient over years impossible. To protect the 
privacy of a patient, a patient ID and the healthcare service 
provider ID, alternative IDs are given alternative ID so that pa- 
tients and providers are not identifiable. Sensitive information 
such as rare diseases and legally designated infectious diseases 
that can lead to the identification of a patient was also removed 
from the from the data. 

DATA ACCESSIBILITY 

Patient Samples can be obtained via website of HIRA by fill- 
ing out the End User Agreement of the Patient Samples. The 
Patient Samples are provided in a DVD (text file) format and a 
fee for the samples is subject to be charged. 

Go to HIRA website -> government 3.0 information open ^ 
application for the use of medical information -> sample data 

http://www.hira.or.kr/dummy.do7pgmid = HIRAA07000100 
0312&cmsurl = /cms/open/02/01/02/index.html. 

Tel: 02-2182-2601 

E-mail: kshyun84@hiramail.net 

AVAILABILITY OF DATA DICTIONARY IN ENGLISH 

HIRA is currently working on the data dictionary in English. 
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Supplementary material is available at http://www.e-epih.org/. 
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